In this document, we analyse missing data patterns in the dataset. The dataset has been obtained as described in the document “Download subsample of pollution data” (ie using europollution package and downloading data for a random sample of 10% of the stations, from 2013 to 2018). In a first step, we focus on one pollutant only, PM10. The present document is build so that it can be easily modified and recomputed for other pollutants or another sample.

Location of measurement stations

In order to assess by guesswork the representativity of the subsample drawn, we look at the location of the station selected:

The sample seems to cover most of Europe and should therefore be somehow representative. Note that the previous map does not represent the full set of stations. For readability reasons, we considered a zoomed version of the map. For more details, one can refer to the following interactive map.

We also consider the number of stations per country:

Country ISO Number of stations in country Number of stations in subsample in country Proportion of stations located in this country Proportion of stations in the subsample located in this country
AD 9 8 0.0002019 0.0019222
AL 53 0 0.0011892 0.0000000
AT 2025 235 0.0454382 0.0564632
BA 105 12 0.0023561 0.0028832
BE 1821 172 0.0408607 0.0413263
BG 299 38 0.0067092 0.0091302
CH 307 25 0.0068887 0.0060067
CY 90 6 0.0020195 0.0014416
CZ 1867 247 0.0418929 0.0593465
DE 5475 454 0.1228515 0.1090822
DK 207 5 0.0046448 0.0012013
EE 112 47 0.0025131 0.0112926
ES 6143 467 0.1378405 0.1122057
FI 422 16 0.0094691 0.0038443
FR 5926 658 0.1329713 0.1580971
GB 2402 261 0.0538976 0.0627102
GE 56 0 0.0012566 0.0000000
GI 29 0 0.0006507 0.0000000
GR 288 29 0.0064623 0.0069678
HR 159 0 0.0035677 0.0000000
HU 281 39 0.0063053 0.0093705
IE 320 23 0.0071804 0.0055262
IS 132 26 0.0029619 0.0062470
IT 4974 483 0.1116097 0.1160500
LT 187 0 0.0041960 0.0000000
LU 79 6 0.0017727 0.0014416
LV 129 31 0.0028946 0.0074483
ME 26 6 0.0005834 0.0014416
MK 138 9 0.0030965 0.0021624
MT 112 0 0.0025131 0.0000000
NL 742 56 0.0166495 0.0134551
NO 380 40 0.0085267 0.0096108
PL 2633 241 0.0590809 0.0579049
PT 553 37 0.0124086 0.0088900
RO 1312 121 0.0294395 0.0290726
RS 154 0 0.0034555 0.0000000
SE 1556 61 0.0349145 0.0146564
SI 206 0 0.0046224 0.0000000
SK 227 58 0.0050936 0.0139356
TR 2579 245 0.0578692 0.0588659
XK 51 0 0.0011444 0.0000000

Missing data patterns in the subsample

Here, we investigate whether missing PM10 concentration data varies across different dimentions.

Across countries

The overall share of missing value is 0.3781148.

We can break this down by countries.

country_iso share_missing
AT 0.0260500
MK 0.0313078
AD 0.0324936
BE 0.0376590
GB 0.0652766
FR 0.0693536
NL 0.0797792
NO 0.0924395
TR 0.1005766
LU 0.1286677
FI 0.1376533
DE 0.1531009
SK 0.1845098
PT 0.1921190
BA 0.2180321
CZ 0.3097139
SE 0.3207205
IS 0.4095204
ES 0.4204464
PL 0.4981073
GR 0.5257783
IT 0.7241246
EE 0.9399007
HU 0.9599505
ME 0.9601598
BG 0.9615132
LV 0.9619673
IE 0.9655351
CH 0.9662758
DK 0.9670484
RO 0.9713371

One can notice that the share of missing values varies drastically across countries. This can be due to the quality of monitoring of air pollution stations. This variation might also come from our data wrangling. We need to check that.

Across dates and time

Overall, we notice that, along the time dimension data almost seems to be missing at random appart from a decreasing trend in the proportion of missing data

One may notice that the share of missing data slightly decreased over time. This might be due to improvement in air measurement capacity.

We look at whether this variation comes from some countries more than others.

Zooming in and looking at monthly data, we can see that this decreasing trend is somehow step wise, with some important decreases between December and January.

We now investigate whether the share of missing data varies across month of the year, day of the month and day of the week. There does not seem to be huge variations.

Hour of the day

Now we investigate whether there are some variation in missingness patterns across time.

One may notice that there is a lot less missing data around 11pm. This might be problematic and needs to be investigated.

This pattern seems to be widespread across countries, even though some countries do not seem to be affected.